Preface

One of the first steps in the analysis of a new dataset, often as part of data cleaning, typically involves generation of high level summaries, such as: the numbers of observations and attributes (variables), which variables are predictors and which ones are (could be?) outcomes; what the ranges, distributions, and percentages of missing values in all the varianbles are; the strength of correlation among the predictors and between the predictors and the outcome(s), etc. It is usually at this stage when we develop our initial intuition about the level of difficulty of the problem and the challenges presented by this particular dataset. This is when (and how) we form our first set of ideas as to how to approach the problem. There are many multivariate methods under unsupervised learning umbrella that are extremely useful in this setting (which will be introduced later in the course), but first things first, and here we will start by loading few datasets into R and exploring their attributes in the form of univariate (e.g. single-variable) summaries and bivariate (two-variable) plots and contingency tables (where applicable).

For this problem set we will use datasets available from the UCI machine learning repository or subsets thereof cleaned up and pre-processed for the instructional purposes. For convenience and in order to avoid the dependence on UCI ML repository availability, we have copied the datasets into the course Canvas website. Once you have downloaded the data onto your computer, they can be imported into R using function read.table with necessary options (of which most useful/relevant include: sep – defining field separator and header – letting read.table know whether the first line in the text file contains the column names or the first row of data). In principle, read.table can also use URL as a full path to the dataset, but here, to be able to work independently of network connection and because of the pre-processing involved, you will need to download those datasets from Canvas website to your local computer and use read.table with appropriate paths to the local files. The simplest thing is probably to copy the data to the same directory where your .Rmd file is, in which case just the file name passed to read.table should suffice. As always, please remember, that help(read.table) (or, ?read.table as a shorthand) will tell you quite a bit about this function and its parameters.

For those datasets that do not have column names included in their data files, it is often convenient to assign them explicitly. Please note that for some of these datasets categorical variables are encoded in the form of integer values (e.g. 1, 2, 3 and 4) and thus R will interpret those as continuous variables by default, while the behavior of many R functions depends on the type of the input variables (continuous vs categorical/factor).

The code excerpts and their output presented below illustrate some of these most basic steps as applied to one of the datasets available from UCI. The homework problems follow after that – they will require you to apply similar kind of approaches to generate high levels summaries of few other datasets that you are provided here with.

Haberman Survival Dataset

Note how the summary function computes a 5-number summary for continuous variable, cannot do anything particularly useful for a general vector of strings, and counts the numbers of occurrences of distinct levels for a categorical variable (explicitly defined as a factor).

habDat <- read.table("haberman.data",sep=",")
colnames(habDat) <- c("age","year","nodes","surv")
summary(habDat$surv)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   1.265   2.000   2.000
habDat$surv <- c("yes","no")[habDat$surv]
summary(habDat$surv)
##    Length     Class      Mode 
##       306 character character
habDat$surv <- factor(habDat$surv)
summary(habDat$surv)
##  no yes 
##  81 225

Below we demonstrate xy-scatterplots of two variables (patient’s age and node count), with color indicating their survival past 5 years. The first example uses basic plotting capabilities in R, while the second one shows how the same result can be achieved with ggplot2 package. Note that in this particular example we choose to show the data stratified by the survival categorical variable in two separate scatterplots, side by side. The reason is purely aesthetic one: since we do indicate distinct classes of patients with different colors it would be entirely possible (and meaningful) to put all the data into a single scatterplot, same way it was done in class. However, the data at hand do not readily separate into (visually) distinct groups, at least in the projection on the two variables chosen here. There would be too much overplotting (exacerbated by the fact that node counts and years take on integer values only), and it would be more difficult to notice that the subset of data shown on the right (survival=yes) is in fact much more dense near nodes=0. It is certainly OK to use the type of visualization that provides the most insight into the data.

oldPar <- par(mfrow=c(1:2),ps=16)
for ( iSurv in sort(unique(habDat$surv)) ) {
    plot(habDat[,c("age","nodes")],type="n",
        main=paste("Survival:",iSurv))
    iTmp <- (1:length(levels(habDat$surv)))[levels(habDat$surv)==iSurv]
    points(habDat[habDat$surv==iSurv,c("age","nodes")],col=iTmp,pch=iTmp)
}

par(oldPar)
ggplot(habDat,aes(x=age,y=nodes,colour=surv,shape=surv)) + 
geom_point() + facet_wrap(~surv) + theme_bw()

It seems that higher number of nodes might be associated with lower probability of survival: note that despite the fact that in both survival outcomes, yes and no, we have patients with large node counts and the distributions above node count of ~10 look pretty much the same (and structureless too), the survival=yes outcome clearly has much higer fraction of low node count cases, as expected. One attempt to quantify this relationship might involve testing relationship between indicators of survival and count of nodes exceeding arbitrarily chosen cutoffs (e.g. zero or 75th percentile as shown in the example below).
In the code example we first generate a 2-way matrix that cross-tabulates the respective counts of cases with all combinations of survival yes/no and node count zero/non-zero values. As you can see, when nodes=0 is true, the survival yes/no outcomes are split as 117/19, while for subset of cases where nodes=0 is false, the survival yes/no values are split as 108/62, which is certainly much worse (you can check that this difference in survival probability is indeed statistically significant; which statistical test would you use for that?). The second part of the code performs pretty much the same task, except that we stratify the patients with respect to node counts being above or below 75% percentile, instead of being zero or non-zero:

habDat$nodes0 <- habDat$nodes==0
table(habDat[, c("surv","nodes0")])
##      nodes0
## surv  FALSE TRUE
##   no     62   19
##   yes   108  117
habDat$nodes75 <- habDat$nodes>=quantile(habDat$nodes,probs=0.75)
table(habDat[, c("surv","nodes75")])
##      nodes75
## surv  FALSE TRUE
##   no     39   42
##   yes   178   47

Please feel free to model your solutions after the examples shown above, while exercising necessary judgement as to which attributes are best represented as continuous and which ones should be represented as categorical, etc. The descriptions of homework problems provide some guidance as to what is expected, but leave some of those choices up to you. Making such calls is an integral part of any data analysis project and we will be working on advancing this skill throughout this course.

Lastly – do ask questions! Piazza is the best for that

Wireless Indoor Localization Data Set (30 points)

This dataset presents an example of classification problem (room identity) using continuous predictors derived from the strengths of several WiFi signals on a smartphone. More details about underlying data can be found in corresponding dataset description at UCI ML website. To load data into R please use data file wifi_localization.txt available both at the course website and/or in UCI ML dataset repository.

Once the dataset in loaded into R, please name appropriately data set attributes (variables), determine the number of variables (explain which ones are predictors and which one is the outcome) and observations in the dataset (R functions such as dim, nrow, ncol could be useful for this), generate summary of the data using summary function in R and generate pairwise XY-scatterplots of each pair of continuous predictors, while indicating the outcome using colour and/or shape of the symbols (you may find it convenient to use pairs plotting function). Describe your observations and discuss which of the variables are more likely to be informative with respect to discriminating these rooms (this literally means: just by looking at the plots, for the lack of better methods that we have not developed just yet, which variables do you think will be more useful for letting us tell which room the smartphone is in).

wifiDat = read.table("wifi_localization.txt",sep="")
colnames(wifiDat) <- c("A","B","C","D","E","F","G","Room")
wifiDat$Room = factor(wifiDat$Room)

head(wifiDat)
##     A   B   C   D   E   F   G Room
## 1 -64 -56 -61 -66 -71 -82 -81    1
## 2 -68 -57 -61 -65 -71 -85 -85    1
## 3 -63 -60 -60 -67 -76 -85 -84    1
## 4 -61 -60 -68 -62 -77 -90 -80    1
## 5 -63 -65 -60 -63 -77 -81 -87    1
## 6 -64 -55 -63 -66 -76 -88 -83    1
tail(wifiDat)
##        A   B   C   D   E   F   G Room
## 1995 -61 -54 -51 -63 -44 -87 -88    4
## 1996 -59 -59 -48 -66 -50 -86 -94    4
## 1997 -59 -56 -50 -62 -47 -87 -90    4
## 1998 -62 -59 -46 -65 -45 -87 -88    4
## 1999 -62 -58 -52 -61 -41 -90 -85    4
## 2000 -59 -50 -45 -60 -45 -88 -87    4
nrow(wifiDat)
## [1] 2000
ncol(wifiDat)
## [1] 8
summary(wifiDat$A)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -74.00  -61.00  -55.00  -52.33  -46.00  -10.00
summary(wifiDat$Room)
##   1   2   3   4 
## 500 500 500 500
op <- par(mfrow=c(2,4),ps=16)

for (n in colnames(wifiDat[1:7])){
  
  hist(wifiDat[[n]],main=paste("Hist",n))
  
}

par(op)

boxplot(wifiDat)

colors <- c("blue", "green", "red", "yellow")  
pairs(wifiDat,col = colors[wifiDat$Room],lower.panel=NULL)

library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggpairs(wifiDat,progress=F,mapping = ggplot2::aes(color = Room))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Next, please comment on whether given the data at hand the problem of detecting room identity on the basis of the strength of the WiFi signal appears to be an easy or a hard one to solve. Try guessing, using your best intuition, what could be an error in predicting room identity in this dataset: 50%, 20%, 10%, 5%, 2%, less than that? Later in the course we will work with this dataset again to actually develop such a classifier, and at that point you will get quantitative answer to this question. For now, what we are trying to achieve is to make you think about the data and to provide a (guesstimate) answer just from visual inspection of the scatterplots. Thus, there is no wrong answer at this point, just try your best, explain your (qualitative) reasoning, and make a note of your answer, so you can go back to it several weeks later.

Finally, please reflect on potential usage of such a model (predicting room identity on the basis of WiFi signal strength) and discuss some of the limitations that the predictive performance of such a model may impose on its utility. Suppose that we can never achieve perfect identification (it’s statistics after all), so we will end up with some finite error rate. For instance, if this model was integrated into a “smart home” setup that turns the light on or off depending on which room the smartphone is in, how useful would be such model if its error rate was, say, 1%, 10% or 50%? Can you think of alternative scenarios where this type of model could be used which would impose stricter or more lax requirements for its predictive performance? Once again, the goal here is to prompt you to consider bigger picture aspects that would impact the utility of the model – there is hardly right or wrong answer to this question, but please do present some kind of summary of your thoughts on this topic, even if in a couple of sentences.
For each room, each signal has a fairly normal distribution about certain magnitudes. For example, for Room 1, the strength of signal A is centered about -60, and for signal G it is at -85. For Room 2, signal A is centered about -40, and for signal G it is at -75.

There is also correlation between certain signals. Signal A and signal D has a very high correlation at 0.921. This high correlation suggests that the information provied from both signal may be redundant in it’s value in differentiating the room.

The more useful signals to identify rooms would have distinct values for each room. Signals C, D and E tends to have more distinct values for each room. However, their distributions have overlaps, so we may get occurences where the C signal is identical in Room 1 and Room 2, and not be able to tell which room we are in based in this signal alone. We would also need signals that can better differentiate between Room 1 and 2, such as A or E.

Using this set of data, our error in identifying a room should be less than 20% or 10%. Each room has a distinct signature of 7 numbers that can be used to identify the room. If all the 7 signals were identical for each room, then our prediction error would be very high, like 50%.

We can use a model from this data to enable a smart home system to turn on and off lights based on the location of our phone. The error rate should be better than 10% or even 1% to be a good system such that the lights do not turn off while we are working on a task in a room.

Wifi signals are used to improve GPS locations on our phones. In a similar fashion, our phone can measures its signals from nearby hotspots, and send it to a server that can interpret the data based in past statistical measurements, and send the location back to the phone. As we are not looking for exact location within inches, a high amount of error is tolerable, like within 10 feet or even more is ok.

Amount of Fund Raising Contributions (30 points)

This dataset presents an example of a regression problem – predicting dollar amount of donors’ contributions from a direct mail campaign based on their demographics and history of past contributions. This dataset is a cleaned up subset of one of the datasets used in data mining competitions in the late 90s and it comes with the requirement of describing its source in rather broad (i.e. non-specific) terms if or when it is used for educational purposes. To load data into R please use file fund-raising.csv available at the course website in Canvas. More details about the data attributes can be found in corresponding file fund-raising-notes.txt also available from our course website in Canvas.

Once the dataset in loaded into R, please determine the number of variables (explain which ones are predictors – categorical vs. continuous – and which one is the outcome) and observations in the dataset (R functions such as dim, nrow, ncol could be useful for this), generate summary of the data using summary function in R and generate pairwise XY-scatterplots of each pair of continuous attributes.

library("zoo")
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
fundDat = read.table("fund-raising.csv",sep=",",header = T)

cat('rows:',nrow(fundDat),"\n")
## rows: 3470
cat('cols:',ncol(fundDat))
## cols: 13
fundDat$mindate = as.Date( as.yearmon( gsub('(.{2})', '\\1 ', fundDat$mindate) , '%y %m' ) )
fundDat$maxdate = as.Date( as.yearmon( gsub('(.{2})', '\\1 ', fundDat$maxdate) , '%y %m' ) )
fundDat$gender = factor(fundDat$gender)

summary(fundDat)
##     contrib           gapmos         promocontr      mincontrib    
##  Min.   :  1.00   Min.   : 0.000   Min.   : 0.00   Min.   : 0.000  
##  1st Qu.: 10.00   1st Qu.: 4.000   1st Qu.: 3.00   1st Qu.: 3.000  
##  Median : 13.00   Median : 6.000   Median : 6.00   Median : 5.000  
##  Mean   : 15.29   Mean   : 7.819   Mean   : 6.57   Mean   : 5.607  
##  3rd Qu.: 20.00   3rd Qu.:10.000   3rd Qu.: 9.00   3rd Qu.: 5.000  
##  Max.   :200.00   Max.   :77.000   Max.   :29.00   Max.   :80.000  
##     ncontrib       maxcontrib        lastcontr        avecontr      
##  Min.   : 1.00   Min.   :   5.00   Min.   :  0.0   Min.   :  2.261  
##  1st Qu.: 6.00   1st Qu.:  11.00   1st Qu.: 10.0   1st Qu.:  7.100  
##  Median :10.00   Median :  15.00   Median : 14.0   Median :  9.894  
##  Mean   :12.34   Mean   :  18.08   Mean   : 14.8   Mean   : 11.162  
##  3rd Qu.:16.00   3rd Qu.:  20.00   3rd Qu.: 20.0   3rd Qu.: 13.197  
##  Max.   :91.00   Max.   :1000.00   Max.   :250.0   Max.   :103.571  
##     mailord           mindate              maxdate                age       
##  Min.   :  0.000   Min.   :1986-08-01   Min.   :1983-12-01   Min.   : 4.00  
##  1st Qu.:  0.000   1st Qu.:1990-07-01   1st Qu.:1994-01-01   1st Qu.:50.00  
##  Median :  1.000   Median :1992-10-01   Median :1995-04-01   Median :64.00  
##  Mean   :  4.439   Mean   :1992-06-21   Mean   :1994-07-13   Mean   :62.63  
##  3rd Qu.:  5.000   3rd Qu.:1994-10-01   3rd Qu.:1995-11-01   3rd Qu.:75.00  
##  Max.   :240.000   Max.   :1997-02-01   Max.   :1997-02-01   Max.   :98.00  
##  gender  
##  F:1871  
##  M:1475  
##  U: 124  
##          
##          
## 
levels(fundDat$gender)
## [1] "F" "M" "U"
head(fundDat)
##   contrib gapmos promocontr mincontrib ncontrib maxcontrib lastcontr  avecontr
## 1       4     12         10          2       15          7         5  4.066667
## 2       5      3         14          3       21          6         5  4.857143
## 3      13     21          5          5       12         17        10 11.000000
## 4      10      6          8          5       10         12        12  9.400000
## 5      10      7          2         10        3         15        10 11.666667
## 6      20      3         16          5       26         15         7  9.576923
##   mailord    mindate    maxdate age gender
## 1      10 1988-01-01 1994-04-01  62      F
## 2       5 1993-12-01 1994-04-01  66      F
## 3       0 1990-01-01 1995-03-01  69      F
## 4      10 1992-09-01 1995-09-01  73      M
## 5       0 1995-11-01 1995-08-01  58      F
## 6      15 1995-05-01 1987-09-01  85      M
tail(fundDat)
##      contrib gapmos promocontr mincontrib ncontrib maxcontrib lastcontr
## 3465      10      1          8       5.00       11         10        10
## 3466      20      7         16       0.07       30         17        17
## 3467      15      4          2       5.00        3         15        15
## 3468       3     15          4       5.00       10         25        20
## 3469      10     10          6       3.00       12         20        20
## 3470      18      4         18       5.00       41         21        18
##       avecontr mailord    mindate    maxdate age gender
## 3465  7.272727       1 1994-12-01 1994-10-01  72      M
## 3466  7.935667       0 1989-06-01 1996-01-01  45      F
## 3467 11.666667       1 1993-10-01 1994-02-01  51      F
## 3468 14.400000       0 1989-06-01 1995-11-01  86      F
## 3469 11.583333       0 1990-03-01 1993-12-01  58      F
## 3470 12.146341       0 1990-11-01 1996-08-01  58      F
fundDat[fundDat$contrib > 150, ] 
##      contrib gapmos promocontr mincontrib ncontrib maxcontrib lastcontr
## 1242     200     13          5         25        8         50        50
## 2265     200      6         17          2       25        250       250
##      avecontr mailord    mindate    maxdate age gender
## 1242   40.875       1 1994-06-01 1993-09-01  91      M
## 2265   12.760      42 1994-08-01 1996-01-01  82      M
fundDat[fundDat$maxcontr > 900, ]
##      contrib gapmos promocontr mincontrib ncontrib maxcontrib lastcontr
## 2686       8      5         15          3       68       1000         5
##      avecontr mailord    mindate    maxdate age gender
## 2686 23.85294       0 1989-09-01 1990-08-01  92      F
op = par(mfrow=c(1,2))

boxplot(fundDat,las=2)

boxplot(fundDat[,c(-2,-5,-6,-9,-10,-11,-12,-13)],las=2)

boxplot(fundDat[,c(10,11)],las=2)

boxplot(fundDat[,c('promocontr','ncontrib','mailord')],las=2)

hist(fundDat$gapmos)

hist(fundDat$age)

barplot(table(fundDat$gender),main="Gender")

par(op)

rbPal = colorRampPalette(c('violet','blue','green','yellow','orange','red'))
colors = rbPal(200) 
pairs(fundDat,col = colors[ fundDat$contrib ],lower.panel=NULL)

library(GGally)

ggpairs(fundDat[1:13],progress=F)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Describe your observations and discuss which attributes might be more useful for predicting the outcome as defined in the fund-raising-notes.txt dataset description.

Try being creative: visualizing and discussing potential associations between each of individual (continuous) predictor variables and the continuous outcome is relatively straightforward. But this time around you cannot really use the outcome to stratify the points in the pairwise predictor-predictor scatterplots same way we did it in Problem 1: there we had just four possible values of the (categorical) outcome, but how many distinct values of the donor contribution do we have now? Do the plots make much sense and are they even interpretable if you use all those distinct values of contributed amount, as they are? Is there a way around this?
In the pairs plot, the contribution amount is set as the color gradient, where red is highest, and violet is lowest.

There are several outliers in the data. For max contribution, there is one data point at 1000, wheras the others are below 200. There is also one last contribution at 250, while the rest are below 150. Generally, the number of high contributors are very few, at 1-3 people.

The last contribution appears to be a good predictor of future contributions. Generally, those with low past contribution has low current contribution, and high past contributions go with high current contributions.

Gap months also has some predictive value, where higher contributions correlate with lower gap months, while low contributions has less correlation as the number of contributions are fairly even across the gap months range.

Similarly, higher contributions tend to come from higher age, while lower contributions come from all age ranges.

There is also a correlation between number of contributions to promo mailing and number of contributions.

For extra 5 points generate boxplots for some of the continuous vs categorical predictors, rendering potential relationships between them.

op = par(mfrow=c(2,2))

#boxplot(contrib~gender,data=fundDat)
boxplot(contrib~lastcontr,data=fundDat)
boxplot(contrib~avecontr,data=fundDat)
boxplot(ncontrib~promocontr,data=fundDat)
boxplot(mincontrib~avecontr,data=fundDat)

par(op)

The plots show some correlation between contrib~lastcontr, contrib~avecontr, ncontrib~promocontr, mincontrib~avecontr

Tibbles (extra 5 points)

Fluency in R (as any other programming language) involves ability to look up, understand, and put to use as necessary new functionality that has not been explored before. One of relatively recent additions to R are so-called tibbles that can be seen as “modern take on data frames”. To earn extra points offered by this problem, please look up tibble use and constrast their behavior to that of conventional data frame using one of the datasets you have already created above. To earn all points available your solution must include more than one example of substantive differences (same kind of difference illustrated by two datasets counts as one example). Please also comment (briefly is fine) on why the use of tibbles may result in more robust code (or not, it’s fine if you happen to find tibbles to be in fact clunkier and not resulting in cleaner code - but you have to argue your point, either way).

library("readr")
library("tibble")

tib_fund = read_delim("fund-raising.csv",",",col_names = T)
## Parsed with column specification:
## cols(
##   contrib = col_double(),
##   gapmos = col_double(),
##   promocontr = col_double(),
##   mincontrib = col_double(),
##   ncontrib = col_double(),
##   maxcontrib = col_double(),
##   lastcontr = col_double(),
##   avecontr = col_double(),
##   mailord = col_double(),
##   mindate = col_double(),
##   maxdate = col_double(),
##   age = col_double(),
##   gender = col_character()
## )
class(tib_fund)
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"
head(tib_fund)
## # A tibble: 6 x 13
##   contrib gapmos promocontr mincontrib ncontrib maxcontrib lastcontr avecontr
##     <dbl>  <dbl>      <dbl>      <dbl>    <dbl>      <dbl>     <dbl>    <dbl>
## 1       4     12         10          2       15          7         5     4.07
## 2       5      3         14          3       21          6         5     4.86
## 3      13     21          5          5       12         17        10    11   
## 4      10      6          8          5       10         12        12     9.4 
## 5      10      7          2         10        3         15        10    11.7 
## 6      20      3         16          5       26         15         7     9.58
## # … with 5 more variables: mailord <dbl>, mindate <dbl>, maxdate <dbl>,
## #   age <dbl>, gender <chr>
df_fund = read.table("fund-raising.csv",sep=",",header = T)

class(df_fund)
## [1] "data.frame"
head(df_fund)
##   contrib gapmos promocontr mincontrib ncontrib maxcontrib lastcontr  avecontr
## 1       4     12         10          2       15          7         5  4.066667
## 2       5      3         14          3       21          6         5  4.857143
## 3      13     21          5          5       12         17        10 11.000000
## 4      10      6          8          5       10         12        12  9.400000
## 5      10      7          2         10        3         15        10 11.666667
## 6      20      3         16          5       26         15         7  9.576923
##   mailord mindate maxdate age gender
## 1      10    8801    9404  62      F
## 2       5    9312    9404  66      F
## 3       0    9001    9503  69      F
## 4      10    9209    9509  73      M
## 5       0    9511    9508  58      F
## 6      15    9505    8709  85      M

When reading in data, tibbles outputs the class of each column, whereas dataframe does not. Tibbles also imports all number columns as numeric; while dataframe imported some of them as double and some of them as integers, depending on the prescene of a decimal point. Tibbles imports the gender as characters, while dataframe imports them as factors. So, tibbles makes less assumptions about the data being imported, while dataframe makes assumptions about the class of the column based on what was read.

With tibbles, we can be more sure that columns with numbers will always be numeric class, and those with letters are character class, which makes the code more robust, especially when we are writing lines of code and not checking the output line by line. With dataframe, we cannot be sure what class each column will be until after the data is read. For those used to dataframes, tibbles may be less convenient and involve more manual conversion of class for each column. For those that prefer to leave column class untouched without explicit instructions to do so, tibbles may be a better alternative.

tib_fund['gapmos']
## # A tibble: 3,470 x 1
##    gapmos
##     <dbl>
##  1     12
##  2      3
##  3     21
##  4      6
##  5      7
##  6      3
##  7     11
##  8      4
##  9     16
## 10      7
## # … with 3,460 more rows
class(tib_fund['gapmos'])
## [1] "tbl_df"     "tbl"        "data.frame"
class(df_fund['gapmos'])
## [1] "data.frame"

A tibble has the class “tbl_df” “tbl” and “data.frame”. A dataframe only has the class “data.frame”. Calling the tibble only shows max 10 lines, but the dataframe can show all lines of data when called.

tib_fund$gapmos[1:10]
##  [1] 12  3 21  6  7  3 11  4 16  7
df_fund$gapmos[1:10]
##  [1] 12  3 21  6  7  3 11  4 16  7
class(tib_fund$gapmos)
## [1] "numeric"
class(df_fund$gapmos)
## [1] "integer"

Tibble considers “gapmos” as numeric, because it only has number; but dataframe considers “gapmos” as integer, because it does not have any decimal points.

tib_fund[,'gapmos']
## # A tibble: 3,470 x 1
##    gapmos
##     <dbl>
##  1     12
##  2      3
##  3     21
##  4      6
##  5      7
##  6      3
##  7     11
##  8      4
##  9     16
## 10      7
## # … with 3,460 more rows
df_fund[,'gapmos'][1:10]
##  [1] 12  3 21  6  7  3 11  4 16  7
class(tib_fund[,'gapmos'])
## [1] "tbl_df"     "tbl"        "data.frame"
class(df_fund[,'gapmos'])
## [1] "integer"

Using similar commands of item[,colname], tibble outputs a tbl_df, while dataframe outputs a vector.

tib_fund[['gapmos']][1:5]
## [1] 12  3 21  6  7
df_fund[['gapmos']][1:5]
## [1] 12  3 21  6  7

**Both tibble and dataframe output a vector with the same command."

rbPal = colorRampPalette(c('violet','blue','green','yellow','orange','red'))
colors = rbPal(200) 

tib_fund$gender = factor(tib_fund$gender)

pairs(tib_fund,col = colors[ fundDat$contrib ],lower.panel=NULL)

A tibble can be used to create a pairs plot and histogram just like a dataframe

hist(tib_fund$gapmos)